Transformers and GPT: a Technical Introduction

The original paper “Attention Is All You Need

Transformers from scratch and a Hacker News Discussion

Harvard’s working Python transformer implementation


luciddrains/x-transformers: “A concise but fully-featured transformer, complete with a set of promising experimental features from various papers.” (Python)


Sebastian Ruder is a Google Research Scientist who blogs about natural language processing and machine learning.

See his An overview of gradient descent optimization algorithms (2016) > Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.